Finding Alternative Translations in a Large Corpus of Movie Subtitle

نویسنده

Jörg Tiedemann

چکیده

OpenSubtitles.org provides a large collection of user contributed subtitles in various languages for movies and TV programs. Subtitle translations are valuable resources for cross-lingual studies and machine translation research. A less explored feature of the collection is the inclusion of alternative translations, which can be very useful for training paraphrase systems or collecting multi-reference test suites for machine translation. However, differences in translation may also be due to misspellings, incomplete or corrupt data files, or wrongly aligned subtitles. This paper reports our efforts in recognising and classifying alternative subtitle translations with language independent techniques. We use time-based alignment with lexical re-synchronisation techniques and BLEU score filters and sort alternative translations into categories using edit distance metrics and heuristic rules. Our approach produces large numbers of sentence-aligned translation alternatives for over 50 languages provided via the OPUS corpus collection.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The Importance of Downtoners in English Writing and Translation -with Reference to Chinese Movie Subtitle Translations

This paper reports on a study of the use of downtoners, an important hedging device, by Chinese translators when translating Chinese subtitles into English. The study was carried out by making a corpus-based analysis of patterns of using downtoners in a Chinese subtitle corpus and an English subtitle corpus. Features of overuse and underuse of downtoners in subtitle translation by the Chinese t...

متن کامل

Using Movie Subtitles for Creating a Large-Scale Bilingual Corpora

This paper presents a method for compiling a large-scale bilingual corpus from a database of movie subtitles. To create the corpus, we propose an algorithm based on Gale and Church’s sentence alignment algorithm(1993). However, our algorithm not only relies on character length information, but also uses subtitle-timing information, which is encoded in the subtitle files. Timing is highly correl...

متن کامل

Phrase-Based Machine Translation based on Simulated Annealing

In this paper, we propose a new phrase-based translation model based on inter-lingual triggers. The originality of our method is double. First we identify common source phrases. Then we use inter-lingual triggers in order to retrieve their translations. Furthermore, we consider the way of extracting phrase translations as an optimization issue. For that we use simulated annealing algorithm to f...

متن کامل

PersianSMT: A first attempt to English-Persian Statistical Machine Translation

In this paper, an attempt to develop a phrase-based statistical machine translation between English and Persian languages (PersianSMT) is described. Creation of the largest English-Persian parallel corpus yet presented by the use of movie subtitles is a part of this work. Two major goals are followed here: the first one is to show the main problems observed in the output of the PersianSMT syste...

متن کامل

OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles

We present a new major release of the OpenSubtitles collection of parallel corpora. The release is compiled from a large database of movie and TV subtitles and includes a total of 1689 bitexts spanning 2.6 billion sentences across 60 languages. The release also incorporates a number of enhancements in the preprocessing and alignment of the subtitles, such as the automatic correction of OCR erro...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2016

Finding Alternative Translations in a Large Corpus of Movie Subtitle

نویسنده

چکیده

منابع مشابه

The Importance of Downtoners in English Writing and Translation -with Reference to Chinese Movie Subtitle Translations

Using Movie Subtitles for Creating a Large-Scale Bilingual Corpora

Phrase-Based Machine Translation based on Simulated Annealing

PersianSMT: A first attempt to English-Persian Statistical Machine Translation

OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles

عنوان ژورنال:

اشتراک گذاری